Chicago Bulls Prospective Player Analysis
2019-20
1. Introduction:
This interactive website is the tangible report component of a reproducible data analysis project. The project is based upon a fictitious task given to the data analytics team for the Chicago Bulls NBA organisation to provide a prospective player analysis report for the General Manager to help rebuild the team for next season. The task detailed the assessment of potential players to join/retain for the Chicago Bulls organisation for the 2019-20 NBA season.
2. Report scenario:
This project is based around the “Moneyball” theory of using statistical analysis to provide a greater insight into sport performances, in this case the selection/purchase of players from the 2018-19 season of the NBA who would help produce greater results for the Chicago Bulls organisation to improve on their past season result (finishing 13th in the Eastern Conference, and 27th overall on Win-Loss ratio) and provide an improved result for the upcoming 2019-2020 NBA season.
The assigned task included the following:
- The assessment of potential players to purchase or retain for the Chicago Bulls organisation for the 2019-20 NBA season.
- Projection of expected results with selected players.
- Selection of 5 players, one from each position
- Center = C
- Power Forward = PF
- Small Forward = SF
- Shooting Guard = SG
- Point Guard = PG
- Ensure purchase of the 5 players was within the allotted budget of $118 million dollars.
- The proposed purchases must allow enough budget to still field the other remaining players required for an NBA team (NBA teams are allowed 15 players total).
The use of statistics in sport is not a new phenomenon1 2, partly due to influences from the likes of Bill James and John Hollinger3 who implemented and revolutionised the use of statistical analysis, which is now common within sports like basketball and in particular the North American basketball league the NBA. John Hollinger created the all in one metric the Player Efficiency Rating or PER, which allowed for the collection of several variables (i.e both positive and negative outcomes e.g points, turnovers, free throw misses, personal fouls etc.) to be used as an indicator of player performance, and especially be able to be used as an inter and intra reliable measure.
Project aim:
The hypothesis for this project is based on the use of a combination of known analysis methods/variables to create a predictive equation to aid in the selection of appropriate players for the Chicago Bulls 19/20 season in the NBA.4
By selecting players that attained above specific values in the selected key metrics, it is hypothesized that an increase in points/min could be achieved which is associated to an increased win percentage.
As such, the project aims to prove that choosing players that contribute to increasing the team points per minute average, would equate to increased team wins, with a goal of achieving greater than 42 wins/50% win percentage and progressing to playoffs5.
The purpose and problem that this method of analysis addresses is a way to see through the inflated market values for athletes and highlight the true value of players based on their repeated habits and trends of play. I believe that the predictive formula below can provide valuable insight into the real value and contribution players are making/could make in a new team.
Positions and key metrics used in the NBA
Basketball has 5 positions. Although the roles are fixed, there is some variation to the roles, and it is common for some players to play across two positions, depending on the team/other members of the team.
Positions and key roles to look at:
- PF = Power Forward
- Offence = Playing near to the basket, rebounding.
- Does have shooting role - 2P > 3P
- Defence = Defending taller players and rebounds.
- Offence = Playing near to the basket, rebounding.
- C = Center
- Offence = Tries to score on close shots and gather offencive rebounds.
- Predominately 2P
- Predominately 2P
- Defence = Centre tries to block opponents’ shots and rebound their misses.
- Offence = Tries to score on close shots and gather offencive rebounds.
- PG = Point Guard
- Offence = Runs plays, shooter, passer, dribble.
- Good shooter 3P > 2P
- Good shooter 3P > 2P
- Defence = Defensively looks to steal from opposing PG
- Offence = Runs plays, shooter, passer, dribble.
- SG = Shooting Guard
- Offence = Predominantly a shooter, dribbler and passer 3P
- Defence = steals and blocks
- Offence = Predominantly a shooter, dribbler and passer 3P
- SF = Small Forward
- Offence = Plays within the key - Shoots regularly - close and far. Universal player
- Defence = Universal role
- Offence = Plays within the key - Shoots regularly - close and far. Universal player
As the game of basketball has evolved, so have the tools used to measure a players performance. The list of useable metrics used and recorded from a standard NBA game are long and each interpretable variable has had its time in the limelight.
Variables/Metrics targeted in the analysis:
The variables selected were used to show an association with an increase in overall win percentage due to an increase of points per minute played.
The variables used for the predictive value were:
- Effective Field Goal Percentage (eFGp)
- Trade Value (TrV)
- Efficiency rate (EFF)
- Usage Rate (Tm_use)
- Total Rebounds per minute (TRB_MP)
- Points per minute (PTS_per_MP)
Dean Oliver6 refers to the “Four Factors” of Basketball adding that metrics/ratings can be broken down into four elements of the game: shooting, turnovers, rebounding, and getting to the foul line. It is in this framework that I believe that using a multifaceted approach to the analysis of player performance decreases the disparity between observed results and predictive results.
Points per minute \[
\beta_1 = -0.382 + 0.699 * eFGp + -0.0330 * TRB\_MP + 2.39 * Tm\_use\_total + 0.00000965 * EFF + -0.00000803 * TrV
\]
Justification and importance:
The previous 2018-19 season saw the Chicago Bulls finish 27th out of 30 teams in the NBA (on win-loss record). The Chicago Bulls organisation has aspirations to rebuild their line-up and field a team with championship title potential for the upcoming 2019-20 season.
Background on variables:
By balancing statistical variables such as usage rates, efficiency ratings and other varying offensive/defensive ratings of the five players on a basketball court, a team can achieve optimal offensive output. This can bee seen through repeated game stats and team habits on the court7.
Interestingly the trends show that, for all players, as a player uses more possessions, his efficiency decreases. In the eyes of some statistical analysts what defines a superstar, is someone who can carry a larger proportion of a teams possessions and produce points with only a relatively small drop in efficiency. Meanwhile, the opposite is also true: Players perform more efficiently when they are asked to use fewer of their teams possessions. As a result, the greater burden on the superstar means that supporting players maintain low usage rates, allowing them to operate closer to their peak efficiency.
In an effort to determine how much impact players have on their teams, sports statisticians have developed metrics such as Usage Percentage. Examining Usage Percentage gives us an indication of how efficient a player is given the amount of possessions he uses.8
What defines a quality player is someone who can have a high Usage Percentage, but still plays at a high rate of efficiency. Teams can look at the Usage Percentage of players on their team, and determine how to balance usage across their lineup to maximize team efficiency.9
As with other combination metrics within sport, the aim of the predictive formula, albeit complicated on first glance, the basic idea is to look at a player’s combination of independent and dependent metrics and find the percentage of the team totals he uses in those same categories.3
Relevant calculations
The following calculations were used within the analysis process and are referred to regularly throughout this report.
Usage rate equation (TM_use_total)
Usage rate equation (TM_use_total)
Usage Rate Usage rate/usage percentage is an estimate of the percentage of how much team plays utilise a player while he was on the floor. The basis of the formula is to look at a player’s combination of field goal attempts, free throw attempts and turnovers, and find the percentage of the team totals he is used in those same categories.
It is calculated by:
\[ Useage(\%) = 100* \frac{((FGA+0.44*FTA+TOV)*(TM\_MP/5))}{(MP*(TM\_FGA+0.44*TM\_FTA+TM\_TOV))} \]
Effective field goal percentage (eFGp)
Effective field goal percentage (eFGp)
Effective Field Goal Percentage (eFGp) A statistic that adjusts field goal percentage to account for the fact that three-point field goals count for three points while field goals only count for two points. Its purpose is to equalise the field goal output percentage between two-point shooters and three-pointers shooters.
It is calculated by: \[ eFG(\%) =\frac{FG+(0.5*3P)}{FGA} \]
Efficiency Value
Efficiency Value
Efficiency Value, is a metric invented by Martin Manley, is being considered the first ever player evaluation metric which indicates a players linear efficiency.10
It is calculated by: \[ EFF = \frac{(PTS + TRB + AST + STL + BLK − (FGA-FG) − (FTA-FT) - TOV)}{GP} \]
Trade Value
Trade Value
Trade Value is the estimate using a players age and his approximate value to determine how much value a player has left in his career. Invented by Bill James.1112
It is calculated by: \[ TrV = \frac{[(AV Formula - 27-0.75*Age)^2(27-0.75*Age +1)*AV Formula]}{190}+(AV Formula)*2/13 \] Approximate Value
Credit Formula and Approx Value are metrics which are an estimate of a players value, making no fine distinctions, but, rather, distinguishing easily between very good seasons, average seasons, and poor seasons13.
It is calculated by:
\[ AV Formula = \frac{(Credits^{3/4})}{21} \] Credits FormulaCredit Formula and Approx Value is an aggregation of observations from a standard game/season, in combination used within the approximate value calculation.
It is calculated by:
\[ Credits Formula = (PTS)+(TRB)+(AST)+(STL)+(BLK)-(FGA-FG)-(FTA-FT)-(TOV) \]
Total rebounds/minute played (TRB_MP)
Total rebounds/minute played (TRB_MP)
The calculation of total rebounds per minute is simple in nature and essential to add in active offensive and defensive rebound involvement at a per minute played value so as to be able to compare across players varying game time levels. Crucial inlfuences on statistical analysis within basketball reference the importance of rebounding as one of the “Four Factors.”
It is calculated by:
\[ TRB\_MP = \frac{(ORB-DRB)}{MP} \]
Points/minute played (PTS_per_MP)
Points/minute played (PTS_per_MP)
Points per minute was included in the analysis to accurately compare points across players. Per-minute ratings were also used to calculate players’ totals in other metrics including points, steals, blocks, assists, turnovers etc, and are calculated by taking the player’s total in the relevant metric and dividing by the total of minutes played.
It is calculated by:
\[ Points\_per\_MP = \frac{MP}{G} \]
Team Win % (WinP_TM)
Team Win % (WinP_TM)
To calculate winning percentage, the number of wins is divided by the number of games played. Team winning percentage was included in the model to explore the relationship of the individual player metrics and their contribution to a teams winning percentage.
It is calculated by: \[ Win\% = \frac{TM\_W}{TM\_G} \]
3. Reading and cleaning the raw data
This section details the process undertaken for the reading, cleaning and exporting of tidy data frames for further analysis. For further information on how to replicate this project please see the hosted GitHub repo for further instructions.
Reading and cleaning steps.
1. Data sources
www.basketball-reference.com Player statistics 2018-19 NBA season.14
www.hoopshype.com Player salaries/year in $USD.15
www.hoopshype.com Oranisation/Team Payroll for 2018-1916
basketball-reference.com Team statistics 2018-19 NBA season.
basketball-reference.com Team statistics 2018-19 NBA season.17
2. Cleaning process
The following steps were carried out to ensure the data was clean and processable ready for analysis:
- Files in *.csv format were imported “read” into the R program.
- Distinguishable names assigned to each data frame.
- Error checking and identify missing values across all data frames.
- Convert NA values (implicit to explicit).
- Remove/Dropping of any empty coloumns.
- Comparison across data frames for common variable names.
- Fix spelling/abbreviations of player names to ensure accurate data matching across data sets.
- Fix team names to match abbreviations across data sets.
- Checking for errors and missing values within the data sets.
- Merging of data frames into one data set to allow comparison of variables.
- Identify duplicates and collapse/aggregate values to have a one season value for each player.
- Address any class issues due to merging i.e. numeric values are numeric etc
- Creation of variables at a rate of minutes played.
- Creation of equations and new variables for predictive model analysis.
- Separation of transfer players within the 18/19 season, identified as TOT players.
3. Tidy data frames exported
- The data frames were exported to *.csv files into a separate folder in the working directory.
- Data frames were imported/read in from clean *.csv files for further analysis via R scripts in R.
4. Exploratory analysis:
Exploratory steps
1. NBA player data base
NBA player list for 2018-19 season
2. Variable distribution
- Large distribution of variable values across the data sets.
- Several outliers exist with high points/min (and other stats) due to minimal game time/games.
- Some left-tailed skewness in distribution.
- Identified thresholds of outlier high points of influence/leverage to test in
Box plot of Points/min per position
Team Win % vs Points/min
Histogram distribution of Effective field goal %
3. Variable relationships:
Relationship between EFF and eFGp
Relationship between Points/min and Team Win %
Relationship between eFGp % and Team Win %
Relationship between Points/min and EFF
4. Linear model assessments
Searching for confounding variables and single linear model:
Relationship between points per min and team winning percentage:
There appears to be a linear relationship between points/min and winning percentage. As points/min increases, there is an increase in team winning percentage.
5. Correlation co-efficient PTS_per_MP and WinP_Tm
The correlation co-efficient = 0.055 suggesting a moderate-strong positive correlation between Points/min and team winning percentage as 0.55 is approaching the value of 1.
| Correlation coefficient |
|---|
| 0.055602 |
6. Simple Linear Regression
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 47.720858 | 2.850014 | 16.7440769 | 0.0000000 | 42.111684 | 53.33003 |
| PTS_per_MP | 5.853553 | 6.151282 | 0.9515989 | 0.3420875 | -6.252917 | 17.96002 |
Call:
lm(formula = WinP_Tm ~ PTS_per_MP, data = df_nonTOT_clean)
Residuals:
Min 1Q Median 3Q Max
-30.1688 -9.5159 0.5934 10.4431 23.5914
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.721 2.850 16.744 <2e-16 ***
PTS_per_MP 5.854 6.151 0.952 0.342
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.24 on 292 degrees of freedom
Multiple R-squared: 0.003092, Adjusted R-squared: -0.0003225
F-statistic: 0.9055 on 1 and 292 DF, p-value: 0.3421
The intercept co-efficient = 47.7, meaning that when the team winning percentage is 0, the expected points per minute = 47.7, which does not make much practical sense, but is a starting point for the model. The slope co-efficient = 5.85, meaning that for every 1 unit that points per min is increased, expected points per minutes increase by 5.85. The r squared value = -0.0003225, meaning that 0.03225% of the variance in team winning percentage is explained by the variance in points per minute.
Call:
lm(formula = WinP_Tm ~ PTS_per_MP, data = df_nonTOT_clean)
Coefficients:
(Intercept) PTS_per_MP
47.721 5.854
7. Independence
The Durbin-Watson statistic = 0.01404402, which is not close to the recommended value of 2 meaning that the assumption of independence is possibly failed. However, this could be due to the filtered data set and the figures are from across teams and there is player movement/transfer between teams, which could influence the independence.
lag Autocorrelation D-W Statistic p-value
1 0.9832565 0.01404402 0
Alternative hypothesis: rho != 0
8. Outlier identification and leverage points
Outliers
There does not appear to be any outliers as all standardised residuals are less than 3.
Leverage points
There are no hat values greater than 1, however it will be useful to investigate the points above 0.025, as they appear to stand out from the rest of the values.
A need to investigate the points above 0.025.
There are 5 points that could be influencing the model (1, 13, 45,104, 200). Determine if the points could be considered high influence
A need to investigate points above 0.015 that are standing out above the rest.
There are 11 points that could be influencing the model (1, 13, 14, 25, 45, 264, 269, 275, 283, 285, 288). This requires further assessment without the high influencing points
A re-run of the linear regression with filtered_df
# A tibble: 2 x 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 47.7 2.85 16.7 1.41e-44 42.1 53.3
2 PTS_per_MP 5.85 6.15 0.952 3.42e- 1 -6.25 18.0
Call:
lm(formula = WinP_Tm ~ PTS_per_MP, data = df_LinR_filtered)
Residuals:
Min 1Q Median 3Q Max
-30.1688 -9.5159 0.5934 10.4431 23.5914
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.721 2.850 16.744 <2e-16 ***
PTS_per_MP 5.854 6.151 0.952 0.342
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.24 on 292 degrees of freedom
Multiple R-squared: 0.003092, Adjusted R-squared: -0.0003225
F-statistic: 0.9055 on 1 and 292 DF, p-value: 0.3421
The removal of high influence points does not show any change to the intercept of WinP_Tm or Pts_per_MP
However visually, plotted without high influence points it looks a lot cleaner and linear.
The graph show a more even spread, and less influenced by the high points.
9. Test for homoscedasticity within data set
A test for homoscedasticity shows that the assumption for homoscedasticity is upheld by plotting the residuals against the fitted values. As such, there does not appear to be evidence of heteroscedasticity.
10. Assesment of normality
Are the residuals normally distributed?
There appears to be some slight skewness and doesn’t look evenly distributed. This is likely from the points investigated for influence. These values did not appear to be influencing the results of the model. This left skewed tail could be due to the large spread of players points scoring. A possible option is to collect more data, and also there are potentially other factors that contribute to winning.
This simple linear regression demonstrates that PTS_per_MP is correlated with WinP_Tm in the NBA. Further analysis of the variables eFGp, EFF, TRB_MP, TrV, and Tm_use_total is proposed to assess their influence into PTS_per_MP and therefore WinP_Tm in a multiple linear regression.All assumptions have been satisfied, with some understanding of the bias of the data set and a multiple linear regression appears to be a robust statistical test to investigate to correlations in this data set. The decision to filter to 40 games for this linear regression was based on the idea to see what the most consistent players in the NBA were scoring, and how that influenced the Win %. This is important as we want players with a high team usage factor and thus high scoring to be influential at the Chicago Bulls.,
11. Linear model assessment
How many more Points/min can be attained when controlling for other factors
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -0.3821332 | 0.0180836 | -21.1315032 | 0.0000000 | -0.4177259 | -0.3465405 |
| eFGp | 0.6985608 | 0.0314831 | 22.1884530 | 0.0000000 | 0.6365947 | 0.7605269 |
| TRB_MP | -0.0329833 | 0.0174981 | -1.8849604 | 0.0604419 | -0.0674237 | 0.0014572 |
| Tm_use_total | 2.3855803 | 0.0366744 | 65.0474831 | 0.0000000 | 2.3133963 | 2.4577642 |
| EFF | 0.0000096 | 0.0000043 | 2.2488204 | 0.0252797 | 0.0000012 | 0.0000181 |
| TrV | -0.0000080 | 0.0000104 | -0.7740553 | 0.4395330 | -0.0000284 | 0.0000124 |
The table above shows that for every 1 increase of eFGp, PTS_per_MP will increase by .699. This is not exactly a practical example as this could equate to a player shooting at greater than 100% for their effective field goal percentage and would then surpass the highest point scorer/min at .98. As such, this number should be viewed as a rate indicator, if you could improve your team’s eFGp across the board by 20 % you would see an increase in 0.13 to points/min.
# A tibble: 6 x 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.382 0.0181 -21.1 1.68e- 60 -0.418 -0.347
2 eFGp 0.699 0.0315 22.2 2.73e- 64 0.637 0.761
3 TRB_MP -0.0330 0.0175 -1.88 6.04e- 2 -0.0674 0.00146
4 Tm_use_total 2.39 0.0367 65.0 3.24e-174 2.31 2.46
5 EFF 0.00000965 0.00000429 2.25 2.53e- 2 0.00000120 0.0000181
6 TrV -0.00000803 0.0000104 -0.774 4.40e- 1 -0.0000284 0.0000124
12. Justification for data modelling
In an effort to determine how much impact players have on their teams, the use of multiple developed metrics such as Usage Percentage, Efficiency Rate, Effective Field Goal Percentage, Trade Value combined with hard points per minute data i.e Points per minute and Total Rebounds per minute gives a well rounded assessment of a players value, and future worth to an organisation7.
By selecting players that featured in greater than 40 games in the season and above select values in the selected key metrics, it is hypothesized that an increase in points/min could be achieved which is associated to an increased win percentage.
As such, choosing players that contribute to increasing the team points per minute average, would equate to increased team wins, with a goal of achieving greater than 42 wins or 50% win percentage and progressing to playoffs.
Players who were transferred during the season were excluded from the analysis and predictive model. The basis of this decision was for the purposes of finding the best value players it was important to see which players contributed regularly to the overall team Win % and were consistently playing in the NBA.
Players with greater than 40 NBA games in 18/19 and who are non-transfer players:
5. Data modelling:
This section covers:
- Player data analysis:
- Data modeling
- Creating a multiple linear regression model,
- Assumption checking
- Model output and interpretation
The table below shows the non transfer NBA player group filtered to the select variables chosen for our analysis/model.
The selected variables are:
- Position
- Salary
- Age
- Team
- Usage %
- EFF Rate
- Trade Value
- eFG %
- Points/min
- Expected points/min
- Total rebounds/min
- Team Win %
NBA player group
Multiple regression
Assumption testing
Added-Variable Plots
Pairs Plots
Multicollinearity occurs when two or more of your explanatory variables are highly related with each other. It can lead to changes in the coefficient estimates and confusion around which variable is explaining the variance in the response variable. As such the multicollinearity undermines the statistical significance of an independent variable.
Below is a visual test for multicolinearity using a pairs plot.
Variance inflation factor
Variance Inflation Factor18 was assessed to identify the correlation between predictors (i.e. independent variables) in a model; its presence can adversely affect your regression results. The VIF estimates how much the variance of a regression coefficient is inflated due to the variables being too alike/related to each other.
| Variance inflation factor | |
|---|---|
| eFGp | 1.428056 |
| TRB_MP | 1.265889 |
| Tm_use_total | 2.282247 |
| EFF | 3.430486 |
| TrV | 1.844460 |
As all of our values are between 1 and 5, it is safe to say that there is some correlation between them.
Variance between the predictive values used showed some “moderate correlation” when tested for multicollinearity. This is explainable due to the the aggregated nature of some of the statistics used i.e. similar variables were used across and within each metric.
Square root of VIF:
The square root of the VIF indicates how much larger the standard error increases compared to the scenario if that variable had 0 correlation to other predictors. From the table below all values are between 1.3 and 1.8, indicating a narrow margin of standard error. A solution to the standard error is to obtain more data across multiple seasons which will produce more precise coefficient estimates.
| Square of VIF | |
|---|---|
| eFGp | 1.195013 |
| TRB_MP | 1.125117 |
| Tm_use_total | 1.510711 |
| EFF | 1.852157 |
| TrV | 1.358109 |
Model output and interpretation
Linear regression and assessment of fit shows the comparison between the predicted/expected values and the actual season observed values. Points above the line = under estimated, below the line = over estimated.
Assesment of fit
Assesment of fit for Player Points/min
Comparison of Actual vs Expected Team Points/min vs Win (%)
Predictive formula for Points/minute
Predictive model for line of best fit
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -0.3821332 | 0.0180836 | -21.1315032 | 0.0000000 | -0.4177259 | -0.3465405 |
| eFGp | 0.6985608 | 0.0314831 | 22.1884530 | 0.0000000 | 0.6365947 | 0.7605269 |
| TRB_MP | -0.0329833 | 0.0174981 | -1.8849604 | 0.0604419 | -0.0674237 | 0.0014572 |
| Tm_use_total | 2.3855803 | 0.0366744 | 65.0474831 | 0.0000000 | 2.3133963 | 2.4577642 |
| EFF | 0.0000096 | 0.0000043 | 2.2488204 | 0.0252797 | 0.0000012 | 0.0000181 |
| TrV | -0.0000080 | 0.0000104 | -0.7740553 | 0.4395330 | -0.0000284 | 0.0000124 |
A predictive formula based off multiple regression model:
Using the approximate values drawn from the exploratory analysis below;
- eFG = 0.55
- TRB_MP = .2
- Tm_use_total = 0.2
- EFF = 1500
- TrV = 600
\[ \beta_1 = -0.382 + 0.699 * 0.55 + -0.0330 * 0.2 + 2.39 * 0.20 + 0.00000965 * 1500 + -0.00000803 * 600 \]
| Points/min |
|---|
| 0.483507 |
The above selected figures were taken from the exploratory analysis gauged on a worst case scenario of player recruitment, with the minimum goal to achieve a points/min ration that would equate to a >50 Win %.
6. Results:
The predictive model hypothesized and created above showed increasing association throughout the model building process. Albeit not a direct correlation, there is evidence of a repeated positive association to predicting a points/minute rate in NBA players/games. The below graphs shows an analysis of points expected and actual points observed compared to the Win % of each team. The graph featured below shows the projected 50% Win line, highlighting the teams who consistently produced winning results and their respective selected variables.
Comparison of Actual vs Expected Team Points/min vs Win (%)
Actual observed points = Black, Expected/Predicted = Blue
7. Player Analysis and Recommendations:
Below is an analysis of teams winning percentage and there corresponding points/minute rate. Given Chicago’s finishing position (bottom left corner of graph below) in both the conference and its overall Win-Loss ratio. The task to provide recommendation for 5 new starting players is essential.
The graphs below are an interactive representation of the players and their respective metrics utilised within our predictive model.
Points per/min vs Win percentage:
Player vs Salary analysis:
Trade Value vs Salary:
Efficiency rate vs Points/min:
Player pool for selection.
The players presented below both individually and collectively will see an increase in the Chicago Bulls points per minute rate and in doing so give the organisation the best chance to consistently play at above a 50% Win rate to progress through to playoffs. The players
Recommended players for the Chicago Bulls 2019-2020 season:
Key figures of listed of players:
- SG = Devin Booker (PHO):
- Highest points/min and points/game
- Efficiency rating >1700
- Trade value >700
- Utilised >30% in Team usage
- Salary = 3.3 Million
- Highest points/min and points/game
- SF = Kawhi Leonard (TOR):
- Highest PT_per_MP for role
- EFficiency rating >1700
- Trade Value >600
- High quantity of Games and Games Started with an eFGP of 54%
- Good age for Small Forward
- Salary = 23 Million - Identified as marque player
- Highest PT_per_MP for role
- PG = Ben Simmons (PHI):
- Highest Trade Value
- Efficiency rating > 2000
- Utilised 22% usage rating
- Above 50% eFGp = 0.56
- Scoring at good rate combined with high efficiency rate for role, PTS_per_MP = .49
- Salary = 6.4 Million
- Highest Trade Value
- PF = Julius Randle (NOP):
- Started more than 50% of games and played in 73 overall
- Ranked 4th for Pts/min for position = Pts_per_MP .70
- 3rd highest for Efficiency 1873 for role
- 3rd highest Trade Value = 695.8854
- eFGp of 55%
- Salary = 8.6 Million
- Utilised > 21% usage rating
- Top 10 for total rebounds per minute for PF role = .28
- Started more than 50% of games and played in 73 overall
- C = Karl-Anthony Towns (MIN):
- Trade Value >700 = 3rd highest for position
- Efficiency rating, highest for position >1700
- Utilised >.21 usage rate
- Above average eFGp = > .58 eFGP
- 3rd highest for Points/min >.73 and >24 PTS/game
- Salary = 7.8 Million
- Trade Value >700 = 3rd highest for position
The combined cost of the above mentioned 5 players totals equals:
| Total Cost |
|---|
| 49343386 |
The required amount to purchase 5 new starting players ($49,343,386 USD) leaves a residual amount of $68, 656,614 USD in the budget to purchase/retain the remaining 10 players needed for a full NBA roster for the 2019-20 season. On average, the residual equates to 6.8 million USD per player, which given the analysis provided is ample to continue to field a robust and extremely competitive new team for the Chicago Bulls organisation.
8. Limitations:
Albeit satisfying the majority of assumptions within acceptable levels, there are inherent biases within this project/model. Several teams throughout their season achieved greater than predicted scoring ratios as such highlighting that there are elements of game play that have not been accurately recorded/reported, in combination with injuries/illnesses that may affect the actual starting line-up of a team, these factors amongst others may be a contribution to varying of results.
There is inherent bias present within the predictive model; utilising explanatory variables that demonstrate correlation carries with it the dependency of execution of said trend, i.e. if a player is out of favour with a coach and not seeing game time can therefore not perform (for example the transfer players), or a certain player has a dependency on another player delivering him the ball in his key position will impact on the modeling and analysis.19
An element of survivorhsip bias is present, as the NBA hosts the best of the best players in the world and then using numeric trends to separate them could higlight a lack of independence of data/variables.
The selection of players who played multiple positions and for multiple teams during the 2018/2019 season were excluded from the analysis. The data set was filtered to display players who had played for 40 or more games, which is just under 50% of the games for the season, as it was evident that the better performing players in each position played the vast majority of the 82 games of the season.
Lastly, the inlcusion of some team specific factors within metrics could influence the perception of the individual player’s performance. For example, a really good player who’s usage rating is high may have a lower efficiency or poor eFG % due to lack of possession of the ball in scoring opportunities20. Conversely, you may have an average player in a very good team. This limitation was attempted to be addressed in the decision to bring the statistics down to a minute played ratio, to aid in the reduction of bias.
9. Summary:
This project highlighted several trends within the NBA data and the NBA overall standings results. This mode of retrospective/prospective analysis still relies on the game based execution of set actions/reactions. This can be seen within the confidence intervals within each predictive variable, showing the margin for difference between expected and observed.
As mentioned before, Dean Oliver refers to the “Four Factors” of Basketball adding that metrics/ratings can be broken down into four elements of the game:
- Shooting
- Turnovers
- Rebounding, and
- Getting to the foul line
These four elements, or “Four Factors,” allow a strategic framework of understanding to be extracted from the game.
It is in this framework that I believe that using a multifaceted approach to the analysis of player performance decreases the disparity between observed results and predictive results.
The purpose and problem that this method of analysis addresses is a way to see through the inflated market values for athletes and highlight the true value of players based on their repeated habits and trends of play. I believe that the predictive formula created in this analysis can provide valuable insight into the real value and contribution players are making/could make in a new team.
10. Glossary:
NBA standard terms:
Project specific:
- Pos = Position
- Tm = Team, abbreviated to three letters, i.e Chicago = CHI, Houston = HOU etc.
- ’…_MP’ = Statistic at a per minute rate
- ‘TM_…’ = Statistic as a team total
- Tm_use_total = Usage Rate is the total use by the team as a percentage across the total number of minutes played
- TrV = Trade Value as an estimation of athlete value taking into account athlete age and game based statistics
- eFGp = Effective Field Goal Percentage (eFGp) allows comparison of 2 and 3 point shooters
- EFF = Efficiency Value is an indicator of an athletes linear efficiency
Data frame specific
2018-19_nba_player-statistics.csv
This data file provides total statistics for individual NBA players during the 2018-19 season.
The variables consist:
- player_name : Player Name
- Pos : (PG = point guard, SG = shooting guard, SF = small forward, PF = power forward, C = center)
- Age : Age of Player at the start of February 1st of that season.
- Tm : Team
- G : Games
- GS : Games Started
- MP : Minutes Played
- FG : Field Goals
- FGA : Field Goal Attempts
- FG% : Field Goal Percentage
- 3P : 3-Point Field Goals
- 3PA : 3-Point Field Goal Attempts
- 3P% : FG% on 3-Pt FGAs
- 2P : 2-Point Field Goals
- 2PA : 2-point Field Goal Attempts
- 2P% : FG% on 2-Pt FGAs
- eFG% : Effective Field Goal Percentage
- FT : Free Throws
- FTA : Free Throw Attempts
- FT% : Free Throw Percentage
- ORB : Offensive Rebounds
- DRB : Defensive Rebounds
- TRB : Total Rebounds
- AST : Assists
- STL : Steals
- BLK : Blocks
- TOV : Turnovers
- PF : Personal Fouls
- PTS : Points
- NB: Players that were traded during the season may appear more than once (on more than one row) so it is important to handle these duplications appropriately.
- NB: Players that were traded during the season may appear more than once (on more than one row) so it is important to handle these duplications appropriately.
2018-19_nba_player-salaries.csv
This data file contains the salary for individual players during the 2018-19 NBA season.
The variables consist:
- player_id : unique player identification number
- player_name : player name
- salary : year salary in $USD
This data file contains the team payroll budget for the 2019-20 NBA season.
The variables consist:
- team_id : unique team identification number
- team : team name
- salary : team payroll budget in 2019-20 in $USD
2018-19_nba_team-statistics_1.csv
This data file contains miscellaneous team statistics for the 2018-19 season.
The variables consist:
- Rk : Rank
- Age : Mean Age of Player at the start of February 1st of that season.
- W : Wins
- L : Losses
- PW : Pythagorean wins, i.e., expected wins based on points scored and allowed
- PL : Pythagorean losses, i.e., expected losses based on points scored and allowed
- MOV : Margin of Victory
- SOS : Strength of Schedule; a rating of strength of schedule. The rating is denominated in points above/below average, where zero is average.
- SRS : Simple Rating System; a team rating that takes into account average point differential and strength of schedule. The rating is denominated in points above/below average, where zero is average.
- ORtg : Offensive Rating; An estimate of points produced (players) or scored (teams) per 100 possessions
- DRtg : Defensive Rating; An estimate of points allowed per 100 possessions
- NRtg : Net Rating; an estimate of point differential per 100 possessions.
- Pace : Pace Factor: An estimate of possessions per 48 minutes
- FTr : Free Throw Attempt Rate; Number of FT Attempts Per FG Attempt
- 3PAr : 3-Point Attempt Rate; Percentage of FG Attempts from 3-Point Range
- TS% : True Shooting Percentage; A measure of shooting efficiency that takes into account 2-point field goals, 3-point field goals, and free throws.
- eFG% : Effective Field Goal Percentage; This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal.
- TOV% : Turnover Percentage; An estimate of turnovers committed per 100 plays.
- ORB% : Offensive Rebound Percentage; An estimate of the percentage of available offensive rebounds a player grabbed while he was on the floor.
- FT/FGA : Free Throws Per Field Goal Attempt
- DRB% : Defensive Rebound Percentage
2018-19_nba_team-statistics_2.csv
This data file contains total team statistics for the 2018-19 NBA season.
The variables consist:
- Team : Team name
- Rk : Ranking
- MP : Minutes Played
- G : Games
- FG : Field Goals
- FGA : Field Goal Attempts
- FG% : Field Goal Percentage
- 3P : 3-Point Field Goals
- 3PA : 3-Point Field Goal Attempts
- 3P% : FG% on 3-Pt FGAs
- 2P : 2-Point Field Goals
- 2PA : 2-point Field Goal Attempts
- 2P% : FG% on 2-Pt FGAs
- FT : Free Throws
- FTA : Free Throw Attempts
- FT% : Free Throw Percentage
- ORB : Offensive Rebounds
- DRB : Defensive Rebounds
- TRB : Total Rebounds
- AST : Assists
- STL : Steals
- BLK : Blocks
- TOV : Turnovers
- PF : Personal Fouls
- PTS : Points
Dr. Jocelyn Mara: Data Analysis in Sport PG21, University of Canberra, 2021
Martin Manley: Kansas City sports reporter and statistician, EFF calculation.
Bill James: Statistician, Trade Value calculation, Approximate Value calculation, Credits Calculation.
John Hollinger: Effective Field Goal percentage and Usage Rate calculation
Dean Oliver: Effective Field Goal percentage and Usage Rate calculation
Basketball-reference.com
Chicago Bulls Logo
This project was designed and built through RStudio, Version 1.4.1103, © 2009-2021 RStudio, PBC